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ABSTRACT 

Motivation: Next Generation Sequencing technologies revolutionized 
many fields in biology by enabling the fast and cheap sequencing 
of large amounts of genomic data. The ever increasing sequencing 
capacities enabled by current sequencing machines hold a lot of 
promise as for the future applications of these technologies, but also 
create increasing computational challenges related to the analysis 
and storage of these data. A typical sequencing data file may occupy 
tens or even hundreds of gigabytes of disk space, prohibitively large 
for many users. Raw sequencing data consists of both the DNA 
sequences (reads) and per-base quality values that indicate the level 
of confidence in the readout of these sequences. Quality values 
account for about half of the required disk space in the commonly 
used FASTQ format and therefore their compression can significantly 
reduce storage requirements and speed up analysis and transmission 
of these data. 

Results: In this paper we present a framework for the lossy 
compression of the quality value sequences of genomic read files. 
Numerical experiments with reference based alignment using these 
quality values suggest that we can achieve significant compression 
with little compromise in performance for several downstream 
applications of interest, as is consistent with our theoretical analysis. 
Our framework also allows compression in a regime - below one bit 
per quality value - for which there are no existing compressors. 
Availability: http : //www. stanford.edu/-mainakch/svdbit .tgz 
Contact: asnanig Stanford, edu, dineshb@ Stanford, edu, 
raainakchgstanford.edu, iochoagstanf ord. edu, 
itaishgberkeley . edu, tsachyg Stanford. edu 

1 INTRODUCTION 

1.1 Background and Motivation 

It has been more than a decade now since the first draft of the human 
genome was published dLander et all. l200lh . The Human Genome 
Project, which required a significant collaborative effort of many 
scientists for more than 10 years, was completed using the Sanger 
sequencing technology and is estimated to have cost almost three 
billion dollars. Just a decade later, many medium and small size 
laboratories achieve the task of sequencing complete mammalian 
genomes within a few weeks using the new next generation 
sequencing (NGS) technologies. Read size is usually smaller for 



NGS sequencers compared to Sanger sequencing, but sequencing 
throughput is significantly higher. Current sequencers are capable 
of generating close to tera-base worth of data that needs to be 
stored and process ed. Several recent studies, such as the cow rumen 
jHess ef a/1 1201 ll) and the MetaHit dOin et a/.L 120101) metagenomic 
projects resulted with hundreds of hundreds of giga-base worth 
datasets. As project scales will continue to grow, it is expected that 
the bottleneck of projects involving massive sequencing will move 
towards the computational aspects, in particular with respect to the 
analysis and storage of the data. As a result, there is a growing 
interest in computational tools that can speed up processing and 
compressing this type of data. 

DNA Extraction 




Assemble Contiguous Fragments 




Gontigs 



Fig. 1. Essential rubrics of gene-sequencing. 



The sequencing process begins with the shearing of the input 
DNA into many short pieces, which are then prepared for 
sequencing, loaded onto the sequencer and sequenced. (Figure [T). 
Different methods are used by the different NGS technologies for 
the readout of the sequencing signal (also known as base calling). 
This process may be interfered by various factors, which may lead 
to wrong readout of the sequencing signal. In order to assess the 
probability for base calling mistakes, sequencers produce scores that 
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Fig. 2. Typical FASTQ record. We focus on lossily compressing the fourth 
line - the quality value string - of every record. 



reflect the level of confidence in the readout of each base. These 
scores are known as quality values and are part of the standard 
sequencing output. In the widely accepted FASTQ format, for 
example, each read is represented by four lines, of which two are 
for the reads themselves and a string of quality values for the read. 
The use of quality values depends on the use of the data. Low 
quality reads and read parts may be removed from the data prior 
to operations that require high- quality data such as the assembly of 
genomes, or mapping-based single nucleotide polymorphism (SNP) 
detection. Next we describe in detail the quality value sequences and 
the FASTQ format. 

1.2 FASTQ format 

We consider the compression of FASTQ files, because of their wide 
acceptance as a standard for storing sequencing data. A FASTQ file 
consists of separate entries for each read, each one consisting of the 
following four lines (see Figure[2]for example): 

(i) header line, always begins with an @ sign, followed by the 
name of the read 

(ii) r - the base pair sequence in the read, where r[i] £ 
{A, C, G, T, N} with N representing an unknown base. 

(iii) quality value header, begins with a + sign which may or may 
not be followed by the read name 

(iv) q - the quality value string for the sequence r. q[i] represents 
the quality value of base r[i]. 

Quality value q[i] represent the level of confidence in the readout 
of its corresponding base r[i], with high quality values representing 
greater confidence. Each quality value is encoded by an ASCII 
character in the FASTQ format based on one of a few accepte d 
schemes. One such standard is the Sanger scale dCock et allfiwd) . 
Quality values in this scale are computed as Q — 33 — 10 log 10 P, 
and range typically from 33 — 73. Lossless compression of read 
files will require, on average, 2 bits/symbol for the base sequence, 
and 6 bits/symbol for the quality values, i.e., three times as much 
storage as is required for the reads themselves. Base calling is an 
inherently noisy process by itself. Based on the amount of noise 
added by this process, it might be possible to achieve a significant 
reduction in the representation of these values with only a marginal 
loss of performance by neglecting a the portion of the quality value 
information that encodes mostly noise. Here we explore the use of 
lossy compression for achieving this goal. 

1.3 Related Work and Our Contribution 

The literature abounds in efforts to compress the genomic data. 
Several approaches exist for compression of whole genomes without 



1 19991). dChenefaZ. 



( Pinho et al. 



l2002h . dCaoefa/.Ll2007h . dSato etali l200lh 



, 1201 lbf) and references therein. 
Whole genome level compressi on without the aid of any external 
information has been the focus of l lChen et a/.L[l999h - dPinho et all 
1201 lbh and references therein. More recent contributions show 
that further compression can be achieved by mapping the target 
genome to a referen ce genome and e ncodi n g only the d if ference s 
between the two dChristlev et all l2008h.dPinho et all 1201 1<J) 
dWang and Zhand. 1201 ll), (Kuruppu et at, 2010), ( Kuruppu et al. , 
1201 lh . jHeathef fldl2O10h . dMa et a/.Ll2012h dChern et all \20m . 
Other approaches consider the problem of losslessly compressing 
the sequence reads together with t he corresponding qualit y values 
dDeorowi cz and GrabowskiL 1201 lh . dTembe et al. , l2010h . while 
in i l TimothTe7a/ri2*008h only the compression of the reads is 
considered. 

In this paper we concentrate only on the lossy compression of 
the quality values as they take up chunk of the storage space. 
Lossy compression of quality values have been considered in 
the litera ture only recen tl y. It has been presented as a plausible 
idea in dLeinonen et all 1201 ll) a nd a concrete a lgorith m as a 
part of SLIMGENE Package in dKozanitis etali 1201 lh . which 
considered the problem of compression of the reads, both of the 
base sequences and quality values sequences. Their compression 
package has a module which does a lossy compression of quality 
values, based on fitting a fixed state markov encoding model on 
adjacent gaps between quality scores. They use SNP variant calling 
as their performance metric and sh ow lossy compres sion has a 
minimal effect on performance. In dFritz et all 1201 lh . a metric 
called "quality-budget" is used to selectively discard the quality 
values which match perfectly to the reference, with o nly quality 
value s with sufficient variation being retained. Recently, dWan et all 
1201 lh . considered relative mapping accuracy of the reads as the 
metric and applied various lossy transform techniques to show 
that impressive compression can be achieved with similar mapping 
between uncompressed and compressed quality value files. The 
importance of lossy compression schemes in the context of the 
dif ferential archiving nee ds of sequenced data has been highlighted 
in dCochrane et allum$) . 

In this work, we use Rate Distortion theory to guide our 
construction of lossy compressors. While details are deferred 
to later sections, in general, Rate Distortion theory pertains to 
compressing or representing an information source with a certain 
number of bits per source component (rate) while minimizing 
the distortion between the source and the reconstruction based on 
said representation, as measured by a given fidelity criterion. As 
already alluded to, due to their inherently noisy nature, some of 
the quality values can be discarded or represented in a coarser 
resolution without incurring much performance loss for downstream 
applications. That is, that an appropriately quantized file should 
give a performance at a downstream application comparable to 
the unquantized or original file. The distortion between the 
uncompressed (i.e., original quality values) and compressed source 
(i.e., the reconstructed quality values after the lossy compression) 
data is a mathematical quantity such as hamming loss, square 
loss, etc, rather than "physical distortion" or performance loss 
due to lossy compression with respect to downstream applications 
such as Alignment, SNP Calls, etc. Hence, the question arises, 
can we compress the quality value file to a certain rate (read, 
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bits, etc.) without incurring much loss with respect to a standard 
distortion criterion and hope that a low distortion would imply 
little compromise in performance at downstream applications? In 
other words, does our fidelity criterion correspond well to "physical 
distortion"? 

Towards answering this question, we use mean square error 
as the distortion criterion for our lossy compression. We choose 
to work with this particular distortion criterion due more to its 
convenient analytical properties than to our belief that it is canonical 
to measuring the loss incurred in the downstream applications. 
Further, we model our source of the quality values in the read 
file as a multivariate gaussian. This is justified by both central 
limit arguments (as the sources of noise in the acquisition of 
the quality values are incremental and independent) and the fact 
that, given a vector source with a particular co-variance structure, 
the Gaussian Multivariate source is the least compressible and, 
further, a code designed under the Gaussian assumption will 
perform at least as well on any other source of the same covariance 
ilLapidothl [l995h . We then suggest a tractable scalar quantization 
algorithm and show numerically that it achieves mean square loss 
comparable to the optimum (that would be achieved using the 
optimal vector quantization). We further demonstrate that achieving 
low mean square error translates to comparable performance in 
the downstream applications as compared to use of the original 
(uncompressed) quality values. 

Our algorithm operates at any non-zero compression rate, and as 
far as we know is the first implementation of lossy compression 
of quality values that can accommodate less than one bit per 
quality value. We find reasonable performance in the downstream 
applications even in this low-rate regime. 

1.4 Organization of the Paper 

The paper is organized as follows. In Section [2] a problem 
formulation is provided with emphasis on how the quality values 
are modeled, the class of schemes which are considered, and the 
performance metrics used. Section [3] provides some background 
on Rate Distortion theory for memoryless sources. Our primary 
compression technique is transform coding via singular value 
decomposition (SVD), which we describe in Section|4] Experiments 
on real data are presented in Section [5] The paper is concluded in 
Section in|6]with directions for future work. 



2 PROBLEM FORMULATION 

We now formalize the problem of lossy compression of quality 
values and describe the general model. As discussed in Section 
11.11 quality values represent the reliability of a particular read 
base. The higher the quality value, the higher the reliability of 
the corresponding read base, and vice versa. More specifically 
quality value is the integer mapping of P (the probability that the 
corresponding base call is incorrect) and is represented in (at least) 
the following different scales/standards : 

• Sanger or Phred scale : Q — — 101og 10 P. 

• Solexa scale : Q = — 10 log 10 y^-p . 

The integer Q values are encoded in ASCII format, for the purpose 
of this work, and without loss of generality, we consider the 



Phred+33 in which the range of quality values is [0, 40]. A quality 
value QV is represented by the letter whose ASCII value is QV + 
33, resulting with letters in the ASCII range of [33, 73]. 

2.1 Modeling Quality Values 

We consider files with fixed read length, /. Denote the number of 
reads in the file by N. The quality values in a file are denoted by 
{Xi}fL u where X; = [Xi(l),--- ,Xi(l)]. In real data quality 
values take integer values in a finite alphabet, X, for example in 
Phred+33 scale, X = {33, 34, ■ • ■ , 72, 73}. However, for the 
purpose of modeling, we assume X — R (the set of real numbers). 
Each X, is modeled as independent and identically distributed 
jointly Gaussian random vector distributed as A/"(mx,£x)- The 
motivation for modeling the reads as Gaussian is already outlined 
in Section [T31 while independence assumption is supported by the 
fact that reads, in general, are randomly sampled from the genome 
in gene-sequencing step. 

2.2 Scalar Quantization 

The compression techniques which are applied and analyzed in this 
paper, can be modeled as in Fig. [3] 
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Fig. 3. Lossy Compression of Quality Values via scalar quantization. 



The quality value vectors Xi are i.i.d. as Px ~ A/"(/ix, £x), 
and each is mapped into decision regions representable in a finite 
number of bits. This is the quantization step represented by F(X;). 
Let the mapped (quantized) values so obtained be referred to by Ui. 

= (Ui, . . . , Ujv) are then losslessly described via a lossless 
encoder (such as a Huffman encoder, LZ encoder, etc.). The rate 
R of compression is the average number of bits per quality value 
used to describe the reads. This completes the compression step and 
is lossy in general due to the quantization step. The quantization 
could be as simple as truncating some entries, or rounding, or 
more sophisticated, such as transform coding which is the focus 
of this paper and will be described in Section [4] We refer to this 
as "scalar quantization" because each read is quantized separately, 
unlike vector quantization techniques where different reads are 
collected and jointly quantized. Vector quantization generalizes 
scalar quantization, but is harder to implement and typically has 
minor performance improvements. 

For reconstructing the quality value sequence, entropy decoding 
is used to losslessly reconstruct from its bit description, and 
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then the mapping G(-) is used to get a lossy estimate of X; as Xi. 
The normalized distortion satisfies 



Xj Xi 



i2JV- 
2 - 



/ 



-E II X-X 



(1) 



where the limiting behavior as N — > oo is due to the law of 
large numbers and E is the expectation over the statistics of source 
X. There is obviously a tradeoff between rate and distortion, 
which depends on the source X and is quantified via the scalar 
rate-distortion function, _R' S '(D,X) or distortion-rate function 
D^ S '(R, X) (the superscript s indicates scalar quantization). 
Henceforth, the compression model in the paper is that of Fig. [3] 
which is a special case of the general compression architecture that 
Rate Distortion theory allows, as is briefly reviewed in Section [3] 
We demonstrate empirically in Section [4] that scalar quantizers for 
our data work as well as any vector quantizer. Also, from now on the 
terms 'quantization' and 'compression' are used interchangeably. 

2.3 Performance Metrics 



As discussed in Section 11.31 the mean square distortion 
is mathematically convenient. Of interest in practice is the 
deterioration in performance when using the lossily reconstructed 
quality values relative to a file containing the original quality values. 
A genome sequence read file, i.e., the FASTQ file can be utilized 
in a variety of applications by a number of downstream analyzers 
in genomics. However, almost all such downstream applications 
would eventually depend on the alignment profile of the reads. 
This alignment may be either de novo or reference based. We 
consider reference based alignment here and describe two natural 
performance metrics that will be used for comparison of our scheme 
with other schemes. They are : 

1. Relative Mapping Accuracy: Let us denote a typical base 
sequence present in the fastq file by r. The position where 
r maps to the reference is denoted by V(r). Let A = 
{(r,V(r)) : r e FASTQ(X)} or the file containing the 
original uncompressed quality values. From now on with 
some benign abuse of notation we will abbreviate the original 
fastq file with uncompressed quality values simply as the 
uncompressed fastq file. Similarly we denote A = {(r,V(r)) : 
r 6 FASTQ(X)} for the compressed fastq file. The relative 
mapping accuracy is simply 



|.4n.4| 



(2) 



1.4 n ,4| + \A\A\ 

where n stands for intersection and A\B denotes the elements 
that belong to A but not to B. 

In other words, the relative mapping accuracy measures what 
percentage of original (uncompressed) reads have mapped to 
the same position on the reference sequence with the lossily 
compressed quality values. 

Symmetric Difference: This is simply 



|.4 \ .4| + |.4 \ .4| 



(3) 



|.4 n .4| + |.4 \ .4| 

This measures the percentage of reads that align to different 
positions on the reference sequence with the uncompressed and 
the lossily compressed quality values. 



Ideally, we would want our read file containing compressed quality 
values to give an alignment profile identical to that of the read file 
with the uncompressed quality values. In other words, we want a 
high relative mapping accuracy and a low symmetric difference. 



RATE DISTORTION THEORY 
PRELIMINARIES 



SOME 



In this section, we provide a brief background on Rate Distortion 
theory for mem oryless sources. For detail ed description and proofs 
please refer to l lCover and ThomasL fl99lh . We consider fixed rate 
schemes which are as follows. Referring to Fig. [4] our goal is 
to encode a source sequence of block length n, X n , using only 
nR bits, in order to minimize the distortion between the original 
source sequence and the reconstruction sequence, X , chosen by 
the decoder. We assume that our given distortion function d : 
(X, X) — > 1Z + operates symbol by symbol (as opposed to block by 
block) and that the distortion D is given byD = E [d(X n ,X")] = 

DEFINITION 1 . A rate-distortion scheme of rate R consists of the 
following: 



{1, 



1. An encoder, f n : X n 

2. A decoder, g n : {1,...,2" H } 

3. A reconstruction sequence, X 



, 2 nR }. 

+ x n . 

1 = <?«(/» (*"))■ 



DEFINITION 2. The pair (R, D) is said to be achievable ifVe > 
0, 3 n and a rate-distortion scheme at rate < R + e and (expected) 
distortion < D + e. 

DEFINITION 3. The rate-distortion function is defined as 

R(D, X) = inf{_R' : (R! , D) is achievable}. Similarly, we 

define the distortion-rate function as D(R,~K) = inf{Z)' : 
(R, D') is achievable}. 



ENCODER 


/„(*») 6 {1,2,. 


.,2»«} 


DECODER 
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Fig. 4. Rate-distortion problem 



TH EOREM 1 . Gaussian Memoryless Scalar Source \Cover and ThomasX 
\l99]\) : For an i.i.d. Gaussian scalar source X ~ Af(p,,o~), the 
rate-distortion and the distortion-rate functions are: 



R{D,X) 

D(R,X) = cr 2 2 



2 

2 n -2R 



ilog(^-)l {D<[T 2 } 



(4) 

(5) 



where ' s the indicator function that takes the value one when 

the event A is true and zero otherwise. 

THEOREM 2. Gaussia n Memoryless Ve ctor Source with 

independent components \Cover and ThomasX \l99l\) : For an 
i.i.d. Gaussian vector source (XJ, M (ji, Ex), with Ex = 



diag[o 1 , 



,o~i] (i.e., independent components), the optimal 
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distortion-rate tradeoff is given as the solution to the following 
optimization problem: 



D(R,X) = min eUi<^ 2 2~ 2p * 
p=[pi,--- ,pi] 

s.t. T, l i=1 pi < R. 



(6) 
(7) 



4 COMPRESSION VIA SINGULAR VALUE 
DECOMPOSITION (SVD) 

The general transform coding paradigm is shown in Fig. [5] If the 
source/signal is "compressible" in a particular domain, intuitively 
one should transform the source/signal to that domain. First the read 
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Fig. 5. The scalar quantization based on transform coding and differential 
bit-allocation. 



Gaussian components under our modeling assumption, the optimal 
value of the above Problem [8] is exactly equal to the optimal 
distortion rate function for a Gaussian source as outlined in Section 
[3] The solution to [8] dictates the number of bits that needs to be 
allocated to store Yj, Ideally this allocation should be done by vector 
quantization for the whole block together. However, due to ease 
of implementation and negligible performance loss, we use a scalar 
quantizer. Thus the component Yi is normalized to a unit variance 
Gaussian (the variances of each component are either known from 
the statistics of the read file or are estimated ) and then it is mapped 
to decision regions representable in pi bits. The decision regions 
and their representative values (stored in T> map (pi) for all possible 
pi) are found from a Lloyd Max procedure on a scalar Gaussian 
distribution, i.e., for pi bits the T> map (pi) will store 2 Pi regions 
(boundary points and representative value for each region) which 
would give the minimum mean squared error for a unit variance 
Gaussian. 



Algorithm 1 SVD-BiiAllocate(AT , R) 

Mx, Ex Empirical mean and covariance of Xi 
Compute SVD E x = V BVd SV? vi , S = diag(cr 2 , . . . , af) 
Precompute Lloyd Max quantizer V map for gaussians 
for i = 1 — ► N do 

V") 4 F sv( [^-i 

p* <— BitAllocate(S, R) 



end for 



Scalar-Quantization(V, nap , Yi, p) 



vector X (vector of length I), is decorrelated by a unitary operation 
matrix V (VV T = /) computed from the empirical statistics of X. 
The Bit Allocation block then allocates bits to each read position. 
This allocation is precomputed by using the statistics of the quality 
value file. Thus for each read, Y = VX is quantized by a scalar 
quantizer, to obtain bits U which are then finally compressed into 
bits by an entropy encoder. To obtain the 'quantized' read vector 
X, the compressed bit description is first entropy decoded and then 
demapped by the quantizer into decisions for each bit sequence 
followed by inverse transform through V T . 

SVD-BitAllocate: Here we perform transform coding as in Fig. 
□with V = V s T vd . 

The rate of bits allotted per quality value sequence is a user 
specified parameter. The source would be compressed accordingly. 
Thus we can formulate the bit allocation problem as a convex 
optimization problem and solve it exactly. That is, given a budget of 
R bits per read, we allocate the bits by first transforming X into Y 
(a decorrelating transform, which by the Gaussian nature makes the 
components independent) and then allocate bits to the independent 
components of Y by solving the following optimization problem, 

minimize p y^^ <r 2 2~ Pl 

i 

subject to S~] pi < R. (8) 

i 

The objective function here is the mean squared error per quality 
value. Hence the rate = 5^ 4=1 pi, where pi is the number of bits 
allocated to the i th component of Y. Since Y has independent 



function BitAllocate(5', R) 

min p i£!= 1 ^2- 2 ' 3 ' 

such that V^ i=1 pi < R 
end function 

function SCALAR-QUANTIZATION(X> map , Yj, S, p) 

for j — 1 — > I do 

Yi(J) <- ±Yi 

Ui(j) 4- Quantize Yi(j) using V map (pj) 

end for 
end function 



Once we get Uj , we may perform lossless compression using 
standard universal entropy coders. However, this was found to 
achieve negligible compression improvements over R bits per read, 
and hence was not considered in the numerical results, to which we 
now turn. 



5 RESULTS 

We present the results of numerical experiments with our algorithm 
on real read data. The data was downloaded f rom the NCBI human 
genome sequence read archive jNCBltl2012[) (reads with identifier 
ERR000531 are used). The total number of quality value sequences 
considered for the data presented is about 20 million, each sequence 
length (i.e., read length) being 46. We tested the mean squared error 
performance of the quan tized quali t y valu es from our algorithm 
against other algorithms l lWan et all 120111) Figure [6] The results 



5 



Asnani et al 



Mean Square Error comparison 
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Fig. 6. Mean squared error plots as a function of the number of bits allocated 
per position. The opti mal here refers t o the solution of Problem [8] Log 
Binning is proposed in (Wan et al, 2 01lh . 



Fig. 8. Fraction of size of symmetric difference over size of m apped reads 
with unquantized values. Log Binning has been proposed in jWan et all 
l201ll) . 
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Fig. 7. Percentage of reads which have been aligned to the same positions 
as the original u nquantized reads. Log Binning has been proposed in 
jWan et alWOvi) . 



show that for low number of bits per quality value position, our 
algorithm achieves much lower mean squared error c ompared to 
existi ng implementations (binning based quantizers et all 

l2oITh . At higher values, the degradation in mean squared error 
(MSE) comes from the fact that our modelling assumption works 
with continuous real numbers, hence the "optimal" mean square 
value does not vanish even with increasingly many bits. Thus, 
the log binning curve (which works with the integers directly) 
performs better and actually goes below the "optimal" MSE curve 
for sufficiently high number of bits. Note that with 6 bits per 
position, we can code for the quality values losslessly. 

We also show the alignment performance by a sequence read 
aligner (bowtie) with our quality value sequences. Figure UJ and 
[8] show the performance of the relative mapping accuracy and 



symmetric difference (defined in Equations [3] and |2j between 
reference based alignment using the quantized files as a function 
of the bits allocated per quality value position. This can influence 
several downstream applications like variant calling /SNP detection 
(Figure |7J. The plots show little performance loss even with very 
small number of bits per position. This also corroborates our 
overall claim that lower distortion with respect to mean square 
loss translates to comparable performance in the downstream 
applications. Further our framework allows us to work with less 
than one bit per quality value, which may prove invaluable in future 
applications where the number of reads and their lengths will be 
increased manyfold. Also note that for zero rate, we are using 
just the mean of the quality values over all the reads. Since this 
needs a constant storage cost, the amortized number of bits required 
to store this information for large numbers of reads is zero. The 
curves in Figure [7] suggest that even with this information, we can 
achieve performances much better than by discarding quality values 
altogether. 

The plots, as expected, show increasingly better match with 
higher rates, with reconstructions using the original quality value 
sequence. This is due to the fact that the alignment performance 
has been compared to the uncompressed values. However, in 
accordance with our conjecture to be studied in future work, the 
curve measuring the 'true' performance with respect to the yet 
unknown 'ground' truth may not be monotone with increasing rate, 
as limiting the rate may denoise the data and hence enhance the 
accuracy. 

6 DISCUSSION 

We have presented a scheme for lossy compression of the quality 
value sequences arising in genomic data. By directly allocating bits 
to the most significant variations, our scheme simply and effectively 
captures the information in the quality value sequence given limited 
storage resources. While refinements such as the use of clustering 
ideas to learn the statistics of the reads more finely would likely 
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result in improved performance in the downstream applications 
(as compared to using the original quality value sequence), we 
suspect, based on preliminary observations, that our scheme may 
also achieve some form of denoising. Our thesis is that appropriate 
lossy compression of the quality values may result not only in 
improved compression ratios, but also in improved performance 
in the downstream applications, such as improved accuracy in 
sequence assembly that would be based on the lossy rather than 
the original version of the quality values. This prospect and its 
applications in high volume read sequencing is an exciting direction 
for future investigations. 
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